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Abstract 

Protein folds are highly designable, in the sense that many sequences fold to 
the same conformation. In the present work we derive an expression for the 
designability in a 20 letter lattice model of proteins which, relying only on 
the Central Limit Theorem, has a generality which goes beyond the simple 
model used in its derivation. This expression displays an exponential depen- 
dence on the energy of the optimal sequence folding on the given conformation 
measured with respect to the lowest energy of the conformational dissimilar 
structures, energy difference which constitutes the only parameter control- 
ling designability. Accordingly, the designability of a native conformation is 
intimately connected to the stability of the sequences folding to them. 



I. INTRODUCTION 

Even a quick look at the set of known proteins (PDB database) reveals a striking feature. 
While there are tens of thousands of protein sequences, they only assume some thousands 
folds. In other words, proteins are highly designable. This concept can be quantified by 
measuring the number of sequences that fold uniquely into a particular structure. 
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With the use of a simple 20 letter lattice model |],|2],|3]||] of protein folding it has been 
shown |BJ that the whole issue of estimating the number n of sequences that fold to the 
same conformation is reduced to enumerate how many of them have native energy lying 
below a threshold E c , the energy which any sequence with the same composition displays in 
the conformation structurally dissimilar to the native conformation 

The aim of the present paper is to provide a reliable, analytic expression for n, which we 
shall show increases exponentially with the gap 6 between the native energy of the optimal 
sequence E n and the threshold energy E c . This functional form is found to be universal, 
as it emerges from the Central Limit Theorem. We have furthermore found that while 
the parameters defining n depend on the interaction matrix, they are independent of the 
particular choice made for the native structure or for the optimal sequence. 

In Sections 2 and 3 we briefly review the 20 letters lattice model of proteins in general 
and the question of protein designability in particular. The quantitative, analytic answer to 
the question of how many mutations a designed protein tolerates is given in Section 4. The 
conclusions are collected in Section 5. 

II. LATTICE MODELS 

A useful theoretical approach to study protein folding is provided by a simplified lattice 
model, where the protein is a string of beads that is arranged on a cubic lattice • The 

configurational energy of a chain of N monomers is given by 

1 N 

E = 2 H U m(i),m(j)H\ri -Tj\), (1) 
M 

where U m u\ m (j\ is the effective interaction potential between monomers m(i) and m(j), r*j 
and fj denote their lattice positions and A(x) is the contact function. In Eq. (1) the pairwise 
interaction is different from zero when i and j occupy nearest-neighbour sites, i.e., A (a) = 1 
and A(na) = for n > 2, where a indicates the step length of the lattice. In addition to 
these interactions, it is assumed that on-site repulsive forces prevent two amino acids to 
occupy the same site simultaneously, so that A(0) = oo. 
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We shall consider throughout a 20-letters representation of protein sequence where U is a 
20x20 matrix. A possible realization of this matrix is given in ref. (Table 5 and 6), where it 
was derived from frequencies of contacts between different amino acids in protein structures. 
The employment of a 20 x 20 matrix ensures that the threshold energy E c is well defined, 
depending only on the interaction matrix elements and on the composition of the protein in 
terms of amino acids. The model we study here is a generic heteropolymer model which has 
been shown to reproduce important generic features of protein folding thermodynamics and 
kinetics, independent on the particular potential chosen 0,0]. However, in using such an 
approach, one should keep in mind that the labelling of amino acids (spherical beads all of 
the same size and with no side chain) is generic too and may be no obvious relation between 
those labels and labels for real amino acids. 

Good-folder sequences are characterized by a large gap 5 = E c — E n (compared to the 
standard deviation a of the contact energies) between the energy of the designed sequence 
in the native conformation E n , and the lowest energy of the conformations structurally 
dissimilar to the native conformation |I|P,P|,fi,|r9 . In other words, good folders are associated 



with a normalized gap £ = 8/a ^> 1, quantity closely related to the z-score For 



example, the 36mer sequence listed in the caption to Fig. 1 and called S 36 in the literature 
|T^ , P^JT5| , p!6|Jl7| , p!8|| , designed by minimizing the energy in the target (native) conformation 



with respect to amino acid sequence for fixed composition has, in the units considered here 
(RT room = 0.6 kcal/mol 0), an energy gap 5 = 2.5 and thus a sufficiently large value of £ 
(= 2.5/0.3 ~ 8.33) so as to ensure fast folding. In fact, Monte Carlo simulations carried out 
at the temperature T = 0.28 of 3000 36mers with energies, in the native conformation, lying 
inside the gap fold in times < 7 x 10 7 MC steps |L9| (For caveats see ref. p0f). In particular 
S 36 folds in 6.5 x 10 6 MC steps. 

It has been also shown that most of the thermodynamical |TB| and dynamical ][T9|JT3 



behaviour of designed proteins is controlled by only 5 — 10% of the sites. As a consequence, 



making mutations in these sites, which are called "hot" in ref. [|18||, one destroys, as a rule, the 



ability the protein has to fold (denaturation). On the other hand, the effects of substitutions 



in any other site (that can be regarded as "cold") are small, leading to neutral mutations. 



III. DESIGN ABILITY WITH 20 LETTERS MODELS 

While twenty letters heteropolymers capture the essential components of real proteins, it 
is hardly possible to enumerate all sequences which have a given conformation as their non- 
degenerate ground state. Accordingly, it is not possible to calculate the exact designability 
of protein conformations. To bypass this problem, we shall determine designability from 
energetic considerations, using a strategy which relies on the fact that all sequences which 
have an energy lower than E c fold on short call, in any case in times which are much shorter 
than that associated with the random search ||. 

Any sequence of a given length iV (e.g. N = 36) can be obtained making m < N 
mutations (i.e. substitution of a amino acid in a given site with a different one) in the 
minimum energy sequence (e.g. S 36 in the case of Fig. 1(a) ). Consequently, the designability 
of a conformation can be found starting from the minimum energy sequence, counting how 
many mutations lay within the gap 5 = E c — E n . If AE is the change in the energy 
of the native state produced by a mutation, p m (AE) the energy distribution probability 
associated with m mutations and n^f the total number of sequences that can be produced 
by introducing m mutations in the minimum energy sequence, designability can be defined 
as 

N 

n= ^2n m , (2) 

m=l 

n m = n t ° t f S p m (AE)d(AE). (3) 
Jo 

So far, we have done nothing more than expressing the problem in another way, since to 
know the spectrum of mutation energies of the optimal sequence one has again to enumerate 
all sequences. In fact, it looks like as if we have made things even worse, in that now one 
has to find the optimal sequence, which is a non-trivial problem, and also has to ensure that 
E c does not change with mutations. 
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We shall show in the following that the distribution of mutation energies does not depend 
on the particular structure or on the particular sequence chosen (provided that E <C E c ) 
nor on the contact energy matrix used to design the protein, but only on its composition 
and on the number of contacts (or the length, if it is fully compact). This observation 
leaves room to approximations. In fact, if one is able to find an approximate expression for 
p(AE), such an expression will hold for all model proteins of the same length. Furthermore, 
the knowledge of the sequence associated with the global minimum of energy E n is not 
necessary (because all sequences have the same spectrum of mutations), only the value of 
E n is required. Consequently, even if the optimal sequence cannot be known without a full 
enumeration of all sequences, it is allowed to use any other sequence with energy E ~ E n , 
introducing in this way only an error in the integration boundaries (and not on the function 
to be integrated). It is then possible to calculate designability of a structure from Eq. (2-3) 
using an approximate distribution p(AE) and an approximate value of S. 

The most conservative way to calculate the number of sequences which fold to a confor- 
mation is then to use a distribution p m (AE) found only by swappings between the residues 
of the optimal sequence, as in such a way the composition is conserved and E c does not 
change. On the other hand, since there are also sequences with different composition fold- 
ing to the same conformation, one is also forced, in principle, to calculate the number n 
associated with pointlike mutations. The values found from swapping of amino acids and 
from pointlike mutations can be viewed as the lower and the upper limit to designability, 
respectively. 

In Figs. 2(a) and 2(b) we display the unnormalized energy distribution probabilities 
associated with two composition-conserving and with two pointlike mutations of S36 (the 
integral of these distributions being the total number of sequences). Each of these curves 
can be well fitted by the sum of two Gaussians, whose means are AE 2 = 1.2 and AE 2 = 3.0 
(Fig. 2(a), composition conserving case) and AE 2 = 1.1 and AE 2 = 3.6 (Fig. 2(b), pointlike 
mutations case). Standard deviations are a 2 = 0.7 and o 2 = 1.0 (Fig. 2(a) ) and o 2 = 0.7 
and a 2 = 1.1 (Fig. 2(b) ). The behaviour of these two distributions seems very alike, except 

5 



for the fact that the area below the composition-conserving curve is much smaller than that 
below the pointlike mutations curve. This is because much fewer mutations can be made 
in the first than in the second case and, consequently, the associated Gaussian behaviour is 
less well defined. 

The overall structure of the curves shown in Fig. 2 can be understood from the fact that 



the average value of AE for "cold" sites is 0.65 and for "hot" sites is 2.87 [lq|. Accordingly, 
the low-energy peak can be associated with two mutations in "cold" sites, while the high- 
energy peak can be associated with a mutation in a "cold" site and a mutation in a "hot" 
site. The contribution from mutations in two "hot" sites leads to an enhancement of the 
high energy tail of the curve. Concerning the Gaussian behaviour, we note that the energies 
associated with the 19 possible mutations on a given "cold" site are uncorrelated. In other 



words, one has to pay an energy AE 2 /2 pa 0.6 (concerning the factor 1/2 one is reminded of 



the fact that AE 2 gets contributions from two mutations) to remove the wild-type residue, 
reflecting the fact that it has been optimized. Second, one has to introduce a new residue 
in the niche left by the wild-type residue. The Gaussian shape of the distribution suggests 
that the niche is neutral with respect to the new residue and that the new interactions 
are merely random. To be more precise, the change in energy AE upon mutations is 
the difference between the energy needed to remove the original residue (which is roughly 
constant and assumes two different values for cold and for hot sites) and the sum of a number 
of contact energies associated with the new residue, energies which can be considered as 
random numbers. Being the sum of random numbers, the energy associated with the new 
residue is forced to respect the Central Limit Theorem, and consequently its distribution 
approaches a Gaussian function. Of course, an exact Gaussian distribution could be reached 
only if the number of nearest neighbours of each site were infinite (while in a cubic lattice 
this is, at most, five). On the other hand, the fact that p m (AE) can be accurately fitted by 
Gaussian distributions (cf., e.g., Fig. 2(b) ) testifies to the fact that we are not far from the 
conditions in which the Central Limit Theorem holds. 

While the hypothesis that "cold" mutations give raise to Gaussian-like peaks is quite 
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grounded, due to the uncorrelateness of the energy contributions of "cold" sites, it is unlikely 
that the Central Limit Theorem works properly for "hot" sites, whose energy contributions 



are correlated |19[| . In order to calculate the degree of designability of a protein conformation, 
we only need to know the contributions from cold sites and, consequently, we don not need 
to better characterize the peaks associated to "hot" sites. 

We have found that the distribution of mutation energies are rather universal functions. 
Examples of such a behaviour are shown in Fig. 3, where 2-pointlikes mutations spectra 
P2(AE) associated with low energy 36mer sequences optimized (making use of the MJ el- 
ements of Table 6) on three different conformation (cf. Figs. l(a)-l(c) ) and with three 
sequences designed on the same conformation (Fig. 1(a) ) are displayed. Similar results have 
been obtained for chains of different lengths. Furthermore, using different 20 x 20 interaction 
matrices lead to the same Gaussian behaviour of p2(AE), although the mean values and the 
standard deviations are different. This is again a consequence of the Central Limit theorem. 
This can be seen from Fig. 4, where we display the function p 2 (AE) associated with two 
pointlike mutations on S36 (cf. Fig. 1), but making use this time of the interaction matrix 
elements listed in Table 5 of ref. |J. Because the average change in energy upon mutations 
in cold sites is zero, while that in hot sites is 0.35, it is easy to identify the peaks associated 



with two cold mutations (AE2 = and <j 2 = 0.34), with one cold and one hot mutations 



(AE 2 = 0.35 and cr 2 = 0.02), and with two hot mutations (AE 2 = 0.70 and a 2 = 0.22). 

Summing up, the function p 2 (AE) associated with chains of different length and sequence 
as well as designed on different native conformations overlap quite nicely, suggesting that 
the spectrum of both composition conserving and non-conserving mutations is universal. 



On the other hand, the actual value of AE 2 and 02 characterizing the different peaks of the 
energy distribution probability depend on the matrix used to describe the contact energies 
among the amino acids. 

The univerality of the energy distribution probability is in agreement with the inter- 
pretation of the main peaks of the spectrum of mutations of a designed protein in term of 
"cold" and "hot" sites. In fact, the properties of the "hot" sites are rather homogeneous, 



their contribution to the mutation spectrum being universal. Assuming furthermore that 
the interactions associated to the mutations in "cold" sites are random, the resulting energy 
distribution is Gaussian (and so universal), and its standard deviation depends only on the 
interaction matrix, while its average value depends on the degree of optimization of the 
wild-type monomer. This, in turn, can be approximated by the degree of optimization of 
the whole chain (measured by the energy gap 5) divided by its length, a quantity which 
is essentially constant for long chains |24| (for example, in the case of S36 this number is 



The basic idea to calculate the designability of a model protein, as we have discussed 
above, is to find a simple approximation to the universal distribution of energies associated 
with mutations onto the optimal sequence, integrate this distribution up to the gap 8 and 
normalize this result to the total number of mutations that one can make (cf. Eqs. (2-3) ). 
As a consequence of this, designability turns out to depend only on the length of the protein 
(through the total number of mutations) and on the gap S. 

In order to calculate n using Eq. (2-3) we have first to know the total number n 1 ^ of 
sequences that can be obtained by making m sequence-conserving mutations (swappings) 
in the optimal sequence. This number can be obtained by counting the number of ways one 
can select m sites, multiplied by the number of permutations of these sites which move all 
the m residues. That is 



where P«(m) is the number of ways one can permute m sites in such a way that only i 
positions are kept fixed. From the relation 



2.5/36 = 0.07). 



IV. HOW MANY GOOD FOLDERS? 




(4) 




(5) 



it is possible to extract the expression for P , 
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m ' 2 (m\ 

P {m) =m\- £ ^ k jPo{m-k)-l. (6) 

For large m, one can use the Stirling approximation for the factorials in Eqs. (4-6), and 
keep only the largest exponent term in the sum (saddle point approximation), obtaining 
n tot eX p(Q, m ). The constant a can be determined from the relation n 1 ^ 1 = e aN = N\, 
which for N = 36 leads to a = 2.66. 

To proceed further in the calculation of n, one needs to find a simple approximation to 
p m (AE). For this purpose, we shall express the energy distribution of an arbitrary number of 
mutations as a convolution of functions p2{AE) associated with the swapping of two amino 
acids. The validity of this approximation rests on the ansatz that every couple of mutations 
affect the energy of the native state independently of the other couple of mutations. This 
approximation is expected to work also for large values of m, where the probability of 
mutating neighbouring sites is not negligible, because the contact energy associated with 
the mutated residues are in any case random quantities with average zero (cf. the discussion 
in the previous section). Within this scenario, the number of folding sequences displaying 
2m mutations and whose energy in the native conformation lies inside the energy gap can 
be written as 

n 2m «< / dE / dAE x dAE 2 ...dAE m _ 1 x 

JO J~oo 

xp 2 (AE 1 ) p 2 {AE 2 ) ... p 2 {AE m ^) p 2 {AE - AE t - AE 2 - ... - AE m ^). (7) 

Making use of the energy distribution probability associated with an amino acid swapping 
(composition conserving mutations) or with two point-like mutations (composition non- 
conserving mutations) one obtains the lower and the upper limit of the designability of a 
conformation. 

In what follows we shall essentially discuss the case of composition conserving mutations. 
If 5 is lower than the peak associated with mutations in "hot" sites (as in the case of the 
sequence S 36 where 5 = 2.5, cf. Fig. 2), one should convolute only the peak of p 2 (AE) 
associated with mutations in "cold" sites ||25|| . Exploiting the fact that the convolution of 
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m Gaussian distributions, of the form exp((A.E — AE 2 ) 2 /2(a 2 ) 2 ) is a Gaussian distribution 
with average AE 2m = mAE 2 and standard deviation a 2m = m x l 2 a 2 , it is possible to write 



« 2m « n'^vrmV 2 ) 1/2 exp ( m ) x 



7A / AS 2 AE 2 AE\ 

For m 3> (5/(2o"2) 1//2 (in the case of S36 this condition means m ^> 2) one can neglect the 
first exponential factor in the integral, in which case the integration can be carried out 
analytically, leading to 



n m « rtf (nm 2 a 2 2 /2)^ 2 exp ( ^ (exp 



AE 2 

70 



1 (9) 



.2(a 2 ) 

where the substitution 2m — *> m has been made. This equation tells us that designability 
increases exponentially with the gap S. In other words, the number of sequences folding to a 
(compact) conformation is determined only by the gap associated with the minimum energy 
sequence. 

We have shown that the concepts of designability (i.e. number of sequences folding to 
a given conformation) and foldability (i.e., thermodinamical stability of the sequences with 
low energy on the given conformation, expressed by the gap S) are intimately connected 
by Eq. (|9|). If a protein is stable in its native conformation, such native conformation is 
necessarilly highly designable. Vice versa, if a conformation is highly designable, there exist 
sequences with a large gap folding to it. 

To give a numerical evaluation of protein conformations, we make use of Eq. (^)) in the 
form 

-2 



— exp 



m=i m 



AE- 



(10) 

where k does not depend on m and, for the case of the structure displayed in Fig. 1(a), 



assumes the value k = 17 (in keeping with the fact that 5 = 2.5, AE 2 = 1.2 and a 2 = 0.7). In 



the case in which a > AE 2 /A{a 2 ) 2 1 which in the case of the 36mer under discussion implies 
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a > 0.73, one can keep only the largest term in the above sum. Within this approximation 
one can write n « e L90x36 = 0.6 • 10 30 , a number to be compared with n 1 ^ = 3.72 ■ 10 41 . 

One can mention, for the sake of completeness, that the number of sequences within the 
gap obtained by pointlike mutations (which is the upper limit to designability) , is well fitted 
by the function 4exp(5m), while the total number of sequences is 19 m (^J. 



V. CONCLUSIONS 

The degree of designability of a given conformation depends exponentially on the energy 
gap 5. Since the number of folding sequences is given by the integral of a universal function 
(the mutation energy distribution) carried up to 5, a quantity which also determines the 
thermal stability of the designed protein, one can conclude that designability and thermal 
stability are strongly interconnected. In other words, sequences displaying large gaps are 
both thermally stable and highly designable. Even sequences displaying, in the native confor- 
mation, a small gap fold on short call and share (in the compaction process) the conserved 
contacts leading to local elementary structures and to the (post-critical) folding nucleus 
19| , |26|| . Consequently, it is possible to obtain from them, through composition-conserving 



mutations, other sequences folding to the same native conformation and displaying a large 
gap. In other words, any sequence able to fold fast, folds to a highly designable conformation. 

We have estimated that there are of the order of 10 30 sequences folding to a compact 
36mer conformation, over a total of 10 41 . This is only the lower limit, but let us assume that 
it describes well the typical degree of designability of the designed protein. Is this number 
small or big? The answer to this question has, of course, important implications from the 
evolutionary point of view. If good folders were distributed homogeneously in the space of 
sequences (like in the case of RNA |27|| ) the important parameter would be their density, that 
is 10~ n . This number would be very low, preventing sequences from moving along neutral 
pathways (which are collections of sequences folding to the same conformation and differing 
by single mutations). Such a scenario is very unfavourable for evolution. The situation is 
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however quite different for proteins. In fact, it has been shown p8| that good folders group 
themselves in clusters and superclusters, giving rise to a quite an inhomogeneous landscape. 
Consequently, the relevant parameter which measures the designability of a conformation 
is the total number of sequences which conserve, in any way, the energy gap. This number 
(> 10 30 ) is very large in particular in keeping with the fact that over a life span of the order 
of 60 mutations occur in the genome of each person ||29|| ). 
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FIGURES 

FIG. 1. Three conformations used as natives in the present study. Sequence S3 6, which is a 
good folder onto structure (a), is SQKWLERGATRIADGDLPVNGTYFSCKIMENVHPLA. 

FIG. 2. Energy distribution for 2 composition-conserving (a) and pointlike (b) mutations. The 
parameters of the Gaussian fit (dotted line) are given in the text. 

FIG. 3. The distribution of energies associated to two composition-conserving mutations made 
on three sequences designed on three different 36mers conformations, (b) The spectrum obtained 
making two composition-conserving mutations on three sequences designed on the same confor- 
mation (the one displayed in Fig. 2.4). The values of the energy gap are 5 = 2.5 (dotted curve), 
5 = 1.6 (solid line) and 5 = 1.3 (dashed line). 

FIG. 4. The distribution p(AE) associated with two pointlike mutations for the structure 
displayed in Fig. 1(a) when the monomers interact with the matrix listed in Table 5 of ref. || 
(instead of Table 6). The dashed line is the Gaussian fit obtained with the least-square method. 
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